Lattice QCD based on OpenCL 



Matthias Bach b , Volker Lindenstruth b , Owe Philipsen a , Christopher Pinke a 

a Institut fiir Theoretische Physik, Goethe- Universitdt, Max-von-Laue-Str. 1, 60438 Frankfurt am Main 
b Frankfurt Institute for Advanced Studies / Institut fiir Informatik - Johann Wolfgang Goethe- Universitdt, 

Ruth- Moufang-Str. 1, 60438 Frankfurt am Main 



Abstract 

We present an OpenCL-based Lattice QCD application using a heatbath algorithm for the pure gauge 
case and Wilson fermions in the twisted mass formulation. The implementation is platform independent 
and can be used on AMD or NVIDIA CPUs, as well as on classical CPUs. On the AMD Radeon HD 
5870 our double precision p implementation performs at 60GFLOPS over a wide range of lattice sizes. 
The hybrid Monte-Carlo presented reaches a speedup of four over the reference code running on a 
server CPU. 
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1. Introduction 

Lattice Quantum Chronic-dynamics (LQCD) is the only a priory approach to describing the strong 
force. State-of-the-art lattice simulations require high-performance computing and constitute actually 
one of the most compute intensive problems overall. With processor clock speeds no longer improving 
and processors instead increasing their core count, Graphics Processing Units (GPUs) with their 
high peak performance and bandwidth have become an interesting platform for high-performance 
computing. In June 2011 three of the top five systems in the Top500 list of supercomputers [5] were 
GPU accelerated clusters. An example of heterogeneous architecture for general purpose computing 
is the LOEWE-CSC [3], which solely consists of AMD hardware and provides two 12-core AMD 
Magny-Cours CPUs and one AMD Radeon HD 5870 GPU in the majority of its nodes. Originally, it 
was ranked 22nd in the Top500 list of supercomputers [2] and ranked eighth in the Green500 list of 
energy-efficient supercomputers (with 718 MFLOPS/Watt) g]. 

Originating in the high-end computer gaming market, GPUs nowadays offer highest computing 
capabilities at a very attractive price-per-flop ratio. Current high-end gaming GPUs by AMD and 
NVIDIA are priced at about 500 Euros. However, while the pioneer work in the field [5|, using 
application programming interfaces (APIs) designed for graphics rendering, was in principle platform 
agnostic, nearly all later developments in the field were based on NVIDIA CUDA [5J, therefore being 
limited to hardware by this single GPU vendor. 

We present the first application of OpenCL [7] to Wilson fermions, enabling the code to be run on 
AMD and NVIDIA GPUs and also on classical CPUs. The work, of which early prototypes have been 
shown in pQ, has been extended to a full hybrid Monte-Carlo application that shows major performance 
gains over a purely CPU based reference code. 

We begin by stating the physical problem we want to solve and then explain the tools used. 
Afterwards we will describe the important parts of our implementation and end with an analysis of 
the performance of our application. 

2. Lattice QCD and Monte Carlo simulations 

The strong interactions between elementary constituents of matter are described by Quantum Chro- 
modynamics (QCD). In this section, we will give an introductory overview of QCD and its evaluation 
by means of Monte-Carlo methods. For more detailed information, we refer to [HUH] • 

QCD is a SU(Nc) gauge theory consisting of gauge and fermion fields. An analytical access by 
means of perturbation theory may be applied only in regions where the coupling constant, g, is small. 
This is ensured for high momentum-transfer ( "asymptotic freedom" ) . To explore the non-perturbative 
regime, euclidean spacetime is discretized on a hypercube with lattice spacing a. Accordingly, the 
QCD action <Sqcd is replaced by a lattice version afflicted with discretization errors, 



and continuum physics can be obtained in the limit a — > 0. 

In statistical physics, the central object is the partition function Z of the system, which allows for 
the measurement of an observable A of interest. On the lattice, the expectation value of A is then: 



where the exponential in the first line is the Boltzmann factor if one identifies Slqcd = fiH. The 
grassmann-valued fermion fields ip can be integrated out exactly, yielding the determinant of the 



Slqcd = £qcd + aS\ + a <S 2 + ■ • ■ , 
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Figure 1: Sketch of lattice discretization at lattice spacing a. w,x,y,z denote lattice sites, ip a fermion field and U a 
gauge link, respectively. Also shown are the plaquette P tM/ {y) and the rectangle product of links i^„(z). 



fermion matrix D, and one is left with an integral over all possible gauge configurations U, which are 
the relevant degrees of freedom. It is convenient to rewrite the fermion determinant as an integral over 
a bosonic field <j> (pseudo fermions), 

det D[U]~ J Vcf>exp{-$D- 1 [U](t>} , (3) 

giving an effective action S c s[U, (/>} = S gaugc [U] + <$D~ X [U]4>. Since it is unfeasible to solve this high- 
dimensional integral, importance sampling methods are used to generate an ensemble of N gauge 
configurations {U m } using the Boltzmann-weight p[U,<p] = cxp {— 5 c ff [U, <fi]} as probability measure. 
Then, (A) can be approximated by 

2.1. Choice of Lattice action 

There is some freedom in the specific form of the lattice action, since it has to agree with QCD in 
the continuum limit only. Different discretizations will suffer differently from discretization artifacts 
and often also do not hold all the continuum properties. Especially the discretization of the fermion 
matrix is non-trivial, since the naive one gives rise to non-physical fermion copies ( "doublers" ) . It is 
also possible to add improvement terms irrelevant in the continuum to the lattice action, which help 
to reduce discretization errors. 

At this point, it is adequate to insert some words related to the dimensions of the quantities involved 
and to fix our notation. We will denote the spatial and temporal extent of the system with N CT and 
N r , respectively, so that the total volume of the system is V to t = x N T . The number of colors, 
Dirac indices and spacetime dimensions are denoted as Nq, Nnirac and Nd, respectively and they 
take on the values 3, 4 and 4. On each lattice site x, the component <p(x) of is a complex valued, 
(NDirac x Nc)-dimensional vector, whereas the gauge field U = U^{x) is a complex valued Nc x Nc 
matrix linking x and its neighbour x + fl in spacetime-direction fl (therefore called links, see fig. [I]). 

In our numerical studies [H5H12] we employ the tree-level Symanzik improved Wilson action in the 
gauge sector, 



?ti sym = E ( c o E i 1 - ReTr(P^(x))} + Cl £{1 - ReTr(iV(x))} ) .5) 



Here, P flv (x) and R fiV (x) denote path-ordered plaquette and rectangle products of link variables (see 
fig. [T| and the parameters are j3 — Q/g 2 , cq = 1 — 8ci and c\ = 1/12. The unimproved gauge action is 
regained setting c\ to zero. 
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In the fermionic sector we use the so-called twisted mass Wilson fermions [T3j to simulate two 
flavours of fermions, whose matrix on the lattice reads 

7±m)q/3 U± t _ t (x) ab 8 n+ p^ y 

(6) 

with shorthand notation 7_ M = — 7^ and U—^x) = U fJi (x — The 7^ are matrices in Dirac space 
satisfying {7^,7^} = 2<? M „ and a,b,a,/3 are color and Dirac indices, respectively. The sign in the 
diagonal mass matrix MJj ag corresponds to up and down flavour. Two mass parameters appear, the 
twisted mass ix and the usual m (via the hopping parameter k = (2(am + 4)) _1 ). Pure Wilson fermions 
can be rcobtaincd at vanishing twisted mass, -Dwiison = D tm (/i — 0). Twisted mass fermions have 
the advantage that if k is tuned such that the renormalized quark-mass vanishes for pure Wilson 
fermions, the O(a) discretization errors in vanish too. For /1 7^ the mass is then determined 
solely by fi and the system is said to be tuned to Maximal Twist. In addition, for two flavours one 
has det£>tm = det (-D^Viison + 4k 2 /i 2 ). Thus, the twisted mass acts as an infrared regulator, which is 
important in avoiding so called exceptional configurations, for which the fcrmion matrix has very small 
eigenvalues rendering its inversion ill-defined (see below). The most used discretization type besides 
Wilson fermions are staggered fermions. Regarding numerics, these differ a lot from the Wilson type 
since the "staggering" essentially means that on each site there is only one spinor component instead 
of four. 

We now want to dwell on the role the fermion matrix plays in LQCD simulations a bit more closely. 
For simplicity, we will drop the indices on D from now on. To evaluate the fermion determinant in 
the partition function one has to know the inverse of D (see <j3j) ) . In addition, the entries of D _1 are 
related to the fermion propagator, thus its inversion is also crucial for measuring fermionic observables. 
To give an example, the two-point correlation function at sites n and m for a generic meson d(n)Tu(n) 
is given by: 

C r = (u(m)Td(m)d(n)Tu(n)) = - Tr [r(D _1 )(», rr^T^D-^n, m)) t 75 ] . (7) 

For more involved observables, more occurrences of D _1 will appear. 

D is a (Nc x N]3i rac x Vtot)~dimensional sparse square matrix, so one uses iterative Krylov space 
based methods to calculate D -1 indirectly out of equations like 

D(j) = ip =► <j) = £TV . (8) 

During the inversion, the matrix-vector product Dip has to be carried out many times. The most 
common solvers used are CG and BiCGStab. The convergence of the solver depends inversely on 
how light the simulated quarks are, and also scales with Vtot- This renders the inversion the most 
involved part of the simulation. In addition, there is a lot of machinery around to simplify ([8]), such 
as mass-preconditioning and, most prominently, even-odd-preconditioning. 

In a simulation the dimensions of the lattice strongly depend on the physical problem under inves- 
tigation. In order to rule out finite size effects the spatial volume is typically large. For simulations 
of thermal systems, temperature is set by T — l/aN T and thus lattices have a rather small tempo- 
ral extent here. Accordingly, in vacuum simulations, the opposite is the case. On the other hand, 
thermal systems require simulations over ranges of parameter values and higher statistics compared 
to vacuum simulations. Typical lattices have currently N CT ss 32, N r 12 in thermal studies and 
No- w 64, N r w 128 for vacuum simulations. Therefore a thermal lattice is typically simulated using 
less parallelism. 



£>* n = (1 ± 2idK/!7 5 ) 8 X yS a pSab ~ 7T C 1 ~ 
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2.2. Ensemble generation 

The standard simulation algorithm to generate QCD gauge configurations is the Hybrid Monte- 
Carlo (HMC) algorithm [T3], where the effective action is embedded in a fictitious classical system 
governed by the Hamiltonian H = ^P 2 + S c s[U, <j)] by introducing Gaussian momenta P conjugate to U. 
Starting from a given configuration (U, P), the system is evolved over a time r to a new configuration 
({/', P'), according to the hamiltonian equations of motions with a force F, 

P = -dS cB /3U = F , 

U = P . (9) 

This evolution is carried out by numerical integration, the most common integration schemes being the 
leapfrog- and the second order minimal (2MN) scheme. 4* an d P are chosen with a Gaussian distribution 
initially and <fr is held constant throughout the evolution of the system. Since the numerical integration 
is not exact, a metropolis step is carried out in the end, thus ensuring detailed balance. This means 
that the new configuration is only accepted with a probability min (1, exp(H[P' , U'\ — H[P, U])). Here, 
P = P^ix) is a real-valued, (Nq x — 1) dimensional vector, as is the force F — F^(x). The latter 
has three contributions with our choice of lattice action: 

P _ pgauge I pgaugc . pfermion fin-i 

r ^plaquette "T" r rectangles "T" r ) \ LyJ ) 

where the gauge part of the action has been divided into plaquette and rectangle parts. These are 
proportional to sums of link products, so called staples. For example, (F^^ cttc ) ^(x) ~ 'YL^v P^v(x), 
where we denote the staple P^ u (x) as the plaquette product of links without U^(x) (see fig. [IJ. Note 
that these sums are no longer elements of SU(Nc)- 

By means of the identity djjD~ x = D~ 1 (dijD)D~ 1 , the fermion force can be evaluated. Note that 
this means that one needs to invert D for the input of the force calculation. In addition, this part of 
the force receives contributions from the p part of the fermion matrix only since our diagonal matrix 
does not depend on U. 

Techniques exist to refine the integration of the equations of motions, r is usually divided into 
smaller substeps and the inverter can be preconditioned by introducing additional pseudo fermions 
via dct(-D) = det(A) j°t(A) ' wnere A denotes a fermion matrix with a different mass than D (mass 
preconditioning). Furthermore, it is beneficial to integrate cheaper force contributions more often 
(multiple timescales). For details we refer to |15j . 

In many cases, one is interested in QCD-like systems without 
fermions, what is then called pure gauge theory. In principal, this 
L(0) = U gives detM = 1 in the equations above and the HMC algorithm is 

for i < m = # subgroups: still applicable. Nevertheless, the so-called heatbath algorithm 
W = L(i) * Staple [18] provides an alternative and more direct way. It is based on an 

w = project (W,i) exact algorithm for SU(2) that updates a specific link according to 

v = update (w) its neighbours. This can be extended to the general SU(Nq) case by 

V = extend(v,i) systematically reducing the SU(Nc) link to a number of SU(2) sub- 

L(i) = L(i-l) * V groups, which for the Nc = 3 case is usually also 3. Figure [2] shows 

U = L(m) a sketch of the algorithm, where W, U, V, L denote S77(Nc) links and 

w, v SU(2) links, respectively, project and extend denote reduction 
„. „ TT ^ j. , and extension to and from the ith subgroup. The update routine 

figure 2: Heatbath algorithm _ / s 

corresponds to the aforementioned exact SU (2) update or to an over- 
relaxation update |19) . which serves to cover the whole configuration space more quickly. Only in the 
first update random numbers are necessary. 
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AMD Radeon HD 5870 


Cypress 


2720 


544 
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AMD FirePro V7800 


Cypress 


2016 


403 


128 


AMD Radeon HD 6970 


Cayman 


2703 


683 


176 


AMD Radeon HD 7970 


Tahiti 


3789 


947 


264 


AMD FirePro W8000 


Tahiti 


3230 


810 


176 


NVIDIA Tesla C1060 


Tesla 


933 


78 


102 


NVIDIA GeForce GTX 280 


Tesla 


933 


78 


142 


NVIDIA GeForce GTX 480 


Fermi 


1345 


132 


177 


NVIDIA GeForce GTX 580 


Fermi 


1581 


198 


192 


NVIDIA Tesla M2090 


Fermi 


1331 


665 


177 


NVIDIA GeForce GTX 680 


Kepler 


3090 


258 


192 


AMD Opteron 6172 


Magny-Cours 


202 


101 


42.7 


AMD Opteron 6278 


Interlagos 


307 


154 


51.2 


Intel Xeon E5-2690 


Sandy Bridge EP 


371 


186 


51.2 



Table 1: Theoretical peak performance of current GPUs and CPUs. SP and DP denote single and double precision, 
respectively. BW denotes bandwidth 

3. OpenCL and Graphic Cards 

Table [T] shows an overview of available GPUs and CPUs. One sees immediately that GPUs surpass 
CPUs in peak performance as well as in memory bandwidth. However, one also notes the drop in 
performance when going from single to double precision on the GPU. To take advantage of this perfor- 
mance an appropriate programming model and a certain understanding of the hardware architecture 
is required. One of these programming models is provided by OpenCL. 

3.1. OpenCL 

GPU applications generally consist of a controlling program ("host"), running on the CPU, that 
executes suitable smaller programs ( "kernels" ) on the GPU. All the memory handling is performed 
by the host program. The most prominent library of such kind is NVIDIA's CUDA 6J. Virtually all 
existing LQCD applications are based on CUDA, at the disadvantage that these are destined to run 
on NVIDIA hardware exclusively [20Ti24j . A hardware independent approach to parallel computing 
is given by the Open Computing Language (OpenCL [7]), which is an open standard to perform 
calculations on heterogeneous computing platforms. Thus, in OpenCL one is not determined to use 
CPUs or GPUs, but it is in principal possible to combine these and distribute the computations among 
all available computing "devices" (see fig. [3]). For this OpenCL defines a programming language that 
is based on C99. Implementations of OpenCL can be found both from AMD (AMD Accelerated 
Parallel Processing (APP) [55], formerly ATI Stream SDK) and NVIDIA (as part of CUDA), as well 
as other vendors. As LQCD applications are predominantly written in CUDA, OpenCL performance 
results are scarcely found. There is but one reported for a HMC using staggered fermions |22j . where 
a significantly lower performance of OpenCL compared to CUDA is reported. 

3.2. GPU Architecture and OpenCL Terms 

Like modern CPUs, GPUs are multi-core Single Instruction Multiple Data (SIMD) processors. 
However, where CPUs are tuned towards processing each thread as fast as possible, the GPU archi- 
tecture is tuned towards high throughput over thousands of threads. 
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Figure 3: OpcnCL Concept 



While the SIMD model on CPUs is based on vector registers, 
where a single processor executes an instruction in parallel for 
multiple elements in the vector register, the GPUs implement a 
variant known as Single Instruction Multiple Threads (SIMT). 
On the GPU registers are seen as scalar by each thread, however 
the execution units, called "processing elements" in OpenCL ter- 
minology, share a common instruction decoder. Therefore a core, 
called "compute unit" in OpenCL, always executes a group of 
threads in lock-step. The group of lock-stepped threads is basi- 
cally equivalent to the SIMD thread on a CPU. However, the memory access is more flexible due to 
the scalar nature of the thread execution. On CPUs the group executed on a compute unit in lock-step 
might contain only one processing element. 

GPUs provide a larger set of registers than CPUs. The AMD Radeon HD 5870 provides 16384 
registers, each 16bytes in size. The NVIDIA GTX 480 provide 32768 registers of 4bytes each. These 
registers are dynamically mapped to threads. Therefore, a compute unit can run either a smaller 
number of threads using even more than a hundred registers each, or more than a thousand threads, 
each using only a dozen registers each. Like hyper-threading on a CPU, running more threads will 
allow to hide memory latencies, increasing overall throughput. The scheduling of the thread groups 
is performed by a hardware scheduler with minimal overhead. Registers stay allocated to each thread 
from its creation until it finished execution. 

The memory architecture of GPUs is more complex than that of CPUs, which appears uniform 
to the user. It is split into multiple logical regions. Global memory is the normal main memory of 
the GPU that can be read and written to by all threads running on the GPU. Private memory is a 
part of global memory that is partitioned among all threads running on the GPU. When addressing 
into private memory each thread accesses its own partition. This memory is also used as a swapping 
place for registers if the register file cannot hold a threads full working set. Registers swapped into the 
private memory are also known as scratch registers. As private memory is part of the global memory 
it shares the same performance characteristics. This means, large latencies can be incurred when 
accessing data from local memory. Therefore, usage of scratch registers usually comes with a large 
performance penalty. Another part of global memory is constant memory. That memory can only 
be written to from the host. GPUs usually are able to cache accesses to this memory and broadcast 
values from this memory to all threads very efficiently In addition modern GPUs also provide a local 
memory. Just like registers, local memory is on-die and can be accessed with similar performance. 
Local memory is shared between threads running on the same compute unit and can be used as a 
user programmed explicit cache. Allocations of memory on the GPU are referred to as "Buffers" by 
OpenCL. 

Stemming from their graphics tradition, GPUs originally only had dedicated read-only caches for 
constant memory and textures. Textures are images stored in memory in a special format. On the 
AMD Radeon HD 5870 and 6970, when keeping to some restrictions, the AMD OpenCL compiler is 
capable to automatically utilize the texture cache to access buffers which are only read by a kernel. 
More modern GPUs like the NVIDIA GTX 480 and the AMD Radeon HD 7970 provide multi-level 
read- write caches. While CPU caches target to minimize latencies in memory access for a single thread, 
GPU caches are shared by many threads. One of their main functions is to coalesce access by multiple 
threads to close-by addresses into single memory transactions. 



4. Implementation strategy 



Since OpenCL needs a quite different approach to LQCD than existing software, we decided to start 
from scratch instead of modifying an existing application. Consequently, all parts of the simulation code 
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are carried out in OpenCL. We set up the host program in C++, which nicely allows for independent pro- 
gram parts using C++ classes and also naturally provides extension capabilities. The main object is the 
class gaugef ield, where the initialization process of OpenCL is incorporated. It manages the physical 
gauge field when several devices are used and holds an application-specific number of opencl_device- 
objects. This class in turn contains all 
compute device-related parts, such as ker- 
nels, and eventually executes the kernels 
on a specific device. Child classes of 
gaugef ield and opencl_module contain 
problem-related functionality and provide 
algorithmic logic. For hybrid applications, 
different "tasks" can be defined to carry out 
a physics problem. These may then again 
contain a number of device objects them- 
selves. The OpenCL environment has to be 
initialized before the actual calculations can 
be carried out (Fig. |4| . To generate the ker- 
nels, the code files are collected, compiled 
and linked into an OpenCL "program" using the OpenCL compiler. To ease debugging we build each 
kernel as a stand-alone program. The binary files, which the compiler produces during the kernel 
compilation, can be reused at a later point and also provide information about the kernel, e.g. register 
usage statistics, important for benchmarking and optimization. Thus, the kernels are created only at 
runtime of the application, which allows to pass runtime parameters (e.g. NT, NS, . . . ) to them as 
compile-time parameters. We can avoid many kernel arguments like this, and parameters are "hard 
coded" into the kernel code. On the device all data types are implemented as structures, with all 
the required operations defined for them. Although this might in some cases require more registers 
than simply operating on arrays of scalars stored in main memory, we opted for this implementation 
strategy for its better code readability. 

It is not trivial to estimate how much the register overhead of the structure based implementation 
strategy is, as the exact register requirements highly depend on the optimizer. The possibly higher 
register requirements of the structure based approach can easily be seen when looking at the addition 
of two structures of four floats. When operating on arrays of scalars there only has to be space for 
three floats in registers. As addition is element-wise, only one float from each operand has to be 
loaded at the same time. Additionally there has to be room for one element of the result. Using 
actual structures four times this space is required, as the whole structure has to be stored for each 
operand and the result. In the case of spinors register requirements would increase from three floats 
to 72. This is, however, a worst case situation as for most operations, e.g. multiplication of a SU3 
matrix with a spinor, more than one element of each structure is required at the same time anyway. 
In addition, given the high latency of GPU memory it does not make sense to completely serialize 
handling of each element in a structure. Therefore the register usage should be higher even when 
performing all operations using scalar types, as the optimizer will use different registers for different 
elements to enable the exploitation of instruction-level parallelism. 

As up till now they have not been relevant for the overall runtime, all LAPACK operations have 
been implemented in a straightforward manner. The only optimization was implicitly given by the 
data type storage format which is described below. ILDG compatible I/O has been implemented as 
well as the Ranlux j^B] PRNG (Pseudo Random Number Generator), as it is the standard choice 
for LQCD simulations. We use the original implementation on the host while on the device we use 
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Figure 4: Schematic flow of program initialization 
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RANLUXClQ an open-source OpenCL implementation of Ranlux. For testing purposes we have also 
implemented the NR3 [27] generator, but it has not been used for the measurements in this paper. 
Initialization of the random number generator follows the usual Ranlux rules which are applied across 
the host and the device, where each OpenCL thread runs on its own Ranlux PRNG state. 

Since on different GPU drivers we have observed multiple miscompilations of our code during 
development, we added regression tests for most of our OpenCL functionality. This allows us to 
quickly check new drivers for incorrect output. A special challenge is that the likelihood for compiler 
errors scales with code complexity. Thus, a function might work perfectly in a simple test case but 
will produce errors when integrated into a larger kernel. Therefore it is important to not only test 
each building block for regressions, but also repeatedly check whether they still work as expected when 
being used in larger kernels. 

4-.1- Memory requirements 

LQCD calculations are always memory bound. To see that consider the theoretical peak perfor- 
mances collected in table [T] and the characteristics of some kernels displayed in table [2] Especially for 
the fermion related kernels, looking at the respective ratios of bandwidth to flops (numerical density) 
shows how the memory bandwidth dominates performance. One the other hand, the heatbath related 
kernels may be expected to perform well on CPUs, since here the bandwidth is somewhat less impor- 
tant, compared to the flops the kernels perform. However, the numerical density is still low compared 
to the peak flops per bandwidth ratio of the GPUs. 

Table [3] shows that the gauge field is the biggest object to store. In addition, one needs more than 
one field of each species during the simulations. However, there are possibilities to reduce these sizes, 
most prominent the aforementioned even-odd preconditioning for the spinorficlds. In general, each 
gauge link may be displayed with — 1 real numbers, one for each generator of the group. Since 
this representation induces additional computational overhead, other methods are more feasible. For 
Nc = 3, one column of the matrix can be reconstructed from the other two exactly, since it has to 
obey c = a x b , so one can discard 6 floats of the full matrix (REC12). Additional restrictions on 
the left 12 numbers allow to save 2 or 4 more floats (REC10 and REC8, respectively). However, the 
last procedure may run into arithmetic problems during the reconstruction |20j . As the reconstruction 
adds additional complexity to the code we have not used it so far except for the p kernel. Most of our 
kernels are already at the edge of the register limit and additional code complexity tends to increase 
the register pressure, even if with perfect register reuse no additional registers would be required. 

4.2. Global memory storage formats 

Our first implementation of the p kernel used an array of structures (AoS) storage format for the 
gauge and spinor fields. Observing this kernel reaching only about 22GB/s of memory throughput on 
the AMD Radeon HD 5870 we started studying characteristics of the memory controller. 

First we checked the effects of utilizing the texture cache, which on AMD hardware can be done 
with minimal code modification. While this provides a significant speedup, the achieved 46GB/s are 
still far from that GPUs theoretical bandwidth limit of 155GB/s. 

Another speedup can be reached by specifying a proper alignment for the larger data types. The key 
is to use largest possible alignment that is still a divisor of the data type size. If no alignment is specified 
the compiler will use the alignment of the smallest contained data type, resulting in superfluous memory 
fetches. While the number of fetches in some cases could be reduced even further by using an alignment 
that is larger than the data type, it turns out that the additional bandwidth required overcompensates 
the benefit. Using proper alignment of all types the code reaches 68GB/s, which is still far from the 
performance limit. 



1 https: / /bitbucket .org/ivarun / ranluxcl 
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Figure 5: Copy performance using different data types to copy a buffer of lOOMiB. Note that the AMD Radeon HD 
5870 was run using the Catalyst 11.7 driver, while all other measurements where performed using Catalyst 12.4. 



Figure [5] shows how using different data types for global memory access effects the speed of copying 
a fixed size buffer on the GPU. The float and double types are scalar data types of 4 bytes and 8 bytes 
size. The float4 type is a built-in data type of OpenCL that acts like a struct of four floats and 
therefore has a size of 16 bytes. The types su3 and spinor are complex structs of size 144 bytes and 
192 bytes, respectively. The figure shows that the best way to access memory is using the double or 
float4 data type. Actually all other types with the same size and alignment will work, too. Therefore 
we now use a double precision floating point complex type as the base storage type for all our data in 
GPU global memory. From size and alignment this type is equivalent to the float4 type. All other 
types are stored in a SoA fashion based on this type. This allows to utilize memory bandwidths of 
up to 120GB/s on the AMD Radeon HD 5870. Care has to be taken, though, as SoA increases the 
register pressure. This can lead to register spilling and therefore cause a performance penalty. 

Another important effect is given by the mapping of the gauge field indices to memory addresses. As 
we are using even-odd preconditioning, neighboring threads will always access either links originating 
in even or those originating in odd sites only. If links originating in even sites are stored interleaving 
with those originating in odd sites, which is the naive format, and a SoA pattern is used, a memory 
access will span data of both types of links. As neighboring threads will only require one type of link 
at that moment, half the memory bandwidth will be wasted. Therefore we split our gauge field such 
that links originating in even sites are stored separate by those originating in odd sites. 

4-.S. Common code for CPUs and CPUs 

As OpenCL can be used both for CPU and GPU programming, we use a single source code for 
the CPU and the GPU implementation. To cater for the different architectures, we introduces some 
abstractions to achieve best performance in both worlds. 

An important difference between the CPU and the GPU is the optimal looping strategy. On CPU 
loops perform best when each CPU works on its own consecutive block of memory. Thereby, it can 
best utilize its time-local cache to reduce the number of actual requests performed on the memory. 
On the GPU however, the best memory throughput is achieved if consecutive cores read consecutive 
elements from memory. Therefore, a loop should always move though the data in a strided way. We 
use a macro PARALLEL_F0R to implement these different looping strategies transparently. 



10 



kernel 


RW-Size 


Flops 


general size [V to t] 


Bytes [Vtot 


P 


2880 


1632 




N D irac x N c x C 


192 


pgaugc 
plaqucttc 


2800 


2717 




(N Dirac x N c x C)/2 


96 


pgaugc 
rectangles 


13168 


14813 


U 


N 2 c xN D xC 


576 


-^fcrmion 


4736 


1748 


^REC12 


2N C x N D x C 


384 


heatbath 


2880 


3912 


t^REClO 


(2N C - 1) x N D x C 


320 


overrelax 


2880 


3846 


C^REC8 


(2N C - 2) x N D x C 


256 


saxpy 


608 


64 


P,F 


(N£ - 1) x N d 


256 



Table 2: Read- Write (RW) sizes in Bytes and flop sizes Table 3: Overview over memory requirements of 

for some HMC related kernels (for double precision). LQCD quantities in double precision per site (and di- 

All numbers are per site (and direction), "saxpy" cor- rection). C denotes the size of one complex number (2 

responds to the algebraic operation z = ax + y. real numbers). 

Another important difference between CPU and GPU is that the GPU prefers a SoA pattern, while 
the CPU tends to prefer an AoS pattern. Therefore we encapsulated all memory accesses into separate 
functions which transparently perform the SoA conversion if required. When moving data onto or 
from a device the same automatism occurs, ensuring best memory access patterns on all devices. 

While OpenCL provides vector data types, which the AMD platform uses for vectorization on the 
CPU, we did not use those in our code. Besides complication of the source code, they would increase 
the amount of registers required on the GPU which are already a sparse resource. 

5. Performance results 

In this section we test our implementations of the heatbath and HMC algorithm and in addition 
report on the performance of the p, the crucial time consuming part for fermionic observables and 
the HMC itself. We tested our LQCD implementations on different architectures with different lattice 
sizes and compared the results to existing applications and literature data. Since we are interested in 
thermal systems, we benchmarked our programs on lattice sizes ranging from N CT = 16, 24, 32, 48 and 
N T = 4, 8, 12, 16. The p and heatbath kernels have also been benchmarked for some additional values 
of N T to explore a wider range of used memory sizes. 

As double precision results are desired for physical measurements we concentrated on such precision 
calculations throughout. However, single precision calculations do promise twice the performance for a 
given device's memory bandwidth and have provided good results in other applications running on the 
same GPUs[3H]. Therefore our results are only a lower bound for the achievable performance. For the 
heatbath algorithm, where no summations over the whole lattice are required, single precision results 
are given. 

The CPUs used are an AMD Radeon HD 5870, as it is used in the LOEWE-CSC, and an AMD 
FirePro V7800, which is the professional grade version of the AMD Radeon HD 5870. It is clocked 
somewhat lower and equipped with more memory. In addition we also used the newer AMD Radeon 
HD 6970 and the AMD Radeon HD 7970, which is the latest GPU available by AMD. For comparison, 
the p benchmarks have also been executed on the NVIDIA GeForce GTX 480, which is of comparable 
age to the AMD Radeon HD 5870, and the NVIDIA GeForce GTX 680, which is the latest GPU 
available by NVIDIA. To show the flexibility of OpenCL, we also simulated on Intel Xeon E5520, 
AMD Opteron 6172 and AMD Opteron 6278 CPUs. All of the benchmarks, except those on the AMD 
Radeon HD 7970, used Catalyst 12.4 which is the most current available AMD GPU driver available 
at the time of writing. On the AMD Radeon HD 7970 the Catalyst 12.2 was used. The NVIDIA CPUs 
were using version 295.41 of NVIDIA's GPU driver. 
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Figure 6: Performance of the single precision hcatbath and overrelax kernels. 



5.1. Heatbath performance 

First, we present results of our implementation of the heatbath algorithm, that is we concentrate 
on the heatbath and overrelax kernels. Figure [6] shows the performance in GFLOPS achieved in single 
precision. While we don't manage to saturate device bandwidth, in both kernels the Cypress and 
Cayman based GPUs provide at least 30GFLOPS for the heatbath kernel for any lattice size. The 
overrelax kernel performs a lot better, here we can achieve at least 100GFLOPS on the GPU, peaking 
up at about 160GFLOPS on the AMD Radeon HD 6970. This behavior can most likely be explained 
by how the compiler manages with the random number generator, since this is the essential difference 
between both kernels. One can also see the different performance of different card generations. CPU 
performances are also given, but they are only a factor 1/4 of those on the GPUs. However, the 
overrelax kernel again outperforms the heatbath kernel. The comparison is not entirely fair, as the 
CPU code did not receive any optimizations besides the proper looping strategy and is not using a 
SoA access pattern. Vectorization might be able to close the gap, although one should keep in mind 
that this is a comparison of two server CPUs to one consumer GPU, where the latter is a much less 
costly solution. 

Not shown is the double precision case, where on the contrary, the performance is of the order 
of 10GFLOPS and thus poor for both kernels on all GPUs. Again, the overrelax kernel performs 
better than the heatbath kernel, however, this time the difference is small. The massive drop in 
performance can be understood when looking at the working set sizes of the kernels. Both kernels 
require storing of a relatively large amount of data throughout kernel execution, rendering register 
reuse difficult. Therefore, in double precision the working set for each thread gets too large to fit 
into the available registers, leading to massive spilling to scratch registers. This reduces performance 
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by both, the additional latency when accessing the spilled data, as well as the increased memory 
bandwidth required to access the spilled data. Therefore, effective algorithmic density is less than 
what is given in Table [2j However, this is also specific to heatbath and overrelax kernels, which can 
be seen in the performances reported below. 

Despite performance, the limiting factor for physical studies here is clearly the GPU main memory. 
It is limited to 1GB on the AMD Radeon HD 5870 and to 2GB on the other two. This sets an upper 
limit for possible lattice sizes. Additionally, on AMD GPUs prior to the HD 7000 generation there is 
no official support to use more than half of the GPUs memory from an OpenCL application. The limit 
can however be circumvented by setting environment variables as specified by the AMD Knowledge 
Base [29 . Also note that an OpenCL runtime can do a multitude of things when it runs out of physical 
memory on the GPU. While current implementations all seem to give an error when running out of 
memory, from the API specification it is perfectly legal to swap buffers to host memory, which will 
come with a major performance penalty. 

Comparing the heatbath kernel performance with |24j . where performances of a CUDA based 
heatbath implementation on NVIDIA GeForce GTX 295 and NVIDIA GeForce GTX 580 is given, 
we observe only half of the reported single precision performances for the heatbath kernel. The 
overrelax kernel performs at around 90% of the peak value of [53] for the AMD Radeon HD 5870 and a 
slightly better performance for the AMD Radeon HD 6970 is found. Note that we do not use memory 
bandwidth reducing storage techniques for the links as they are used in [23J (REC12). Other than the 
code in [24] . our implementation can work with any number of lattice points that fit into the GPU 
memory and is not limited by the number of threads used. This shows that all the tested GPUs can 
in principle be used to efficiently run the algorithm. 

5.2. p performance 

For simulations including fermions the inversion of the fermion matrix is essential and thus the 
performance of evaluating Defy is crucial. Since for twisted mass fermions the diagonal part of D is not 
dependent on the gauge field, the p performance has to be considered most carefully. We measured 
it on various lattice sizes in double precision, the results can be seen in figure [7] 

The slowest of our GPUs, the AMD FirePro V7800 levels at nearly 60GFLOPS and is above 
50GFLOPS even for small lattices like 16 3 x 4. The gamer GPUs AMD Radeon HD 5870 and AMD 
Radeon HD 6970 scale nicely with their higher peak memory bandwidth, achieving about 60GFLOPS 
and 80GFLOPS, respectively. In addition, one can observe better performances with the AMD Radeon 
HD 5870 for lattices sizes that have a spatial extent that is a multiple of 16, where the kernels performs 
at about 70GFLOPS. The performance of the AMD Radeon HD 7970 is not better than that of the 
AMD Radeon HD 6970, however this GPU was measured using a preview driver. As that driver does 
not provide any useful resource usage statistic we did not optimize our kernel for that GPU. As we 
cannot ensure proper register usage we simply reused the memory stride optimization found to be 
optimal on the AMD Radeon HD 5870. 

Also shown is the bandwidth utilization, where apart from the unoptimized AMD Radeon HD 7970 
at least 70% of the theoretical peak bandwidth are reached on all AMD GPUs. On the AMD Radeon 
HD 6970, when running at 140GB/s, 80% of the peak bandwidth are achieved, peaking up to 86% for 
lattices like 16 3 x 24. For the AMD Radeon HD 5870, the utilization peaks at 120GB/s (about 78% 
of the theoretical peak bandwidth), which is actually even higher than the about 105GB/s achieved 
on this GPU in a simple float4 copying kernel as shown in figure [5j This can be explained by the 
loop-unrolled characteristic of the SoA memory access and the higher read-to-write ratio of the p 
kernel, as only reads can benefit from the cache of the AMD Radeon HD 5870. 

Additionally, we give p performances obtained on the NVIDIA GPUs GeForce GTX 480 and 
GeForce GTX 680. Both cards constantly reach about 25GFLOPS and 40GB/s, respectively. The 
latter corresponds to about 20% of the theoretical peak. The NVIDIA GPUs show register spilling, 
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Figure 7: Performance and memory bandwidth utilization of the JZ) kernel on several GPUs (double precision). 



which is probably the performance limiting factor. While the NVIDIA GPUs have a smaller register 
file than their AMD counterparts, that is only part of the problem. In CUDA it is possible to have the 
compiler increase the number of available registers at the cost of fewer threads being able to run on the 
GPU concurrently. In addition, it is possible to reconfigure the ratio of LI cache to shared memory. A 
larger LI cache would probably avoid some of the slowdown due to the spilled registers. Sadly neither 
of these features is exposed when using OpenCL on the NVIDIA hardware. With the current code, 
the AMD GPUs achieve 3-4 times more performance with the same kernel, but the comparison is not 
entirely fair, as the AMD GPUs have been the primary development platform. 

There are a couple of reported performances in the literature, which are all based on CUDA 
implementations. Using the AMD Radeon HD 5870 as our reference value, our p kernel is nearly 
twice as fast as the one shown in [2D], even though that kernel uses REC12 to reduce bandwidth 
requirements. One should however keep in mind that the NVIDIA GeForce GTX 280 is one generation 
older than the AMD Radeon HD 5870. Our p is also 40% faster than that shown in [23]. That 
performance was measured on a NVIDIA GeForce GTX 480 which, as mentioned above, is as old as 
the AMD Radeon HD 5870 used by us. 

Furthermore, figure [8] shows the effect of REC12 on the p performance exemplarily for the AMD 
Radeon HD 5870 and the AMD FirePro V7800. This technique reduces the RW load of the kernel 
by 13% (see table |3j), and thus should result in a visible speedup. An increase in performance up to 
9% can indeed be observed and proves again that the performance is bandwidth limited. Note that 
we did not include the computational overhead induced by the reconstruction technique in the FLOP 
count. Compared to |20j . where actually a decrease in double precision performance is reported when 
decreasing the RW load, these results support the good performance picture of our p implementation. 
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Table 4: Setups used for benchmarking. f3 = 
twist. The value of is only approximate. 



3.9 and n = k c 



0.160856 were used throughout to simulate at maximal 



5.3. HMC performance 

In order to test our HMC program under realistic conditions, we used setups corresponding to one 
heavy and two lighter pion masses, see table [4] The parameters were chosen according to |30j in order 
to simulate at maximal twist. In each setup, we started from a prior generated gauge configuration 
and performed 10 HMC steps with r = 1 (Setup C) and r = 0.1 (Setup A and B), respectively. 
We used separate time scales for the gauge and fermion parts and on each the 2MN integrator with 
10 integration steps. To gain statistics, each run has been carried out O(10) times. The OpenCL 
application was run on the AMD FirePro V7800 and the AMD Radeon HD 6970. For comparison, we 
performed the same measurements using the CPU-based application tmlqcd |31) . running on different 
numbers of cores on the LOEWE-CSC (AMD Opteron 6172 CPUs). Statistical errors have been 
neglected since they were found to be below the percentage limit. 

The results for setup C are shown in figure |9j where runs for fixed N T = 8 and fixed N a = 24 have 
been performed. The reference tmlqcd was executed on one and two whole nodes, which means 16 and 
32 and 24 and 48 CPU cores, respectively, for the different lattice sizes, since the lattice extent has to 
be divisible by the number of cores here. 

One can see nice scaling in all reference runs, the additional MPI overhead is only visible for small 
lattices. The GPU runtimes are always comparable to or better than the reference values. Especially 
the AMD Radeon HD 6970 can achieve a better performance than 32 and 48 CPU cores throughout, 
respectively. Considering the N CT = 24 runs, we measure a speedup of the AMD FirePro V7800 of 
about 1.7 compared to the 24 core reference runs and a slow-down of 0.9 compared to the 48 core runs. 
The AMD Radeon HD 6970 is faster in both comparisons, about 2.5 and 1.3, respectively. Note that 
this is a comparison to two and four full CPUs, where each CPU is more expensive then the CPUs 
used. We did not perform reference runs on a single CPU core, but we observed a scaling of about 0.75 
when running tmlqcd on more than one core, which can be explained by the additional MPI overhead. 
This scaling yields a speedup of about 30 for the AMD FirePro V7800 and a speedup of about 44 for 
the AMD Radeon HD 6970, compared to one AMD Opteron 6172 core. However, this comparison is 
a bit academic, since one would practically not use only one core of a multi core CPU. 

A similar picture emerges also for the lighter pion masses (see figure 10), where the runtime scales 
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Figure 9: HMC runtimes in seconds for setup C (see table[4j| for fixed N T = 8 and N CT = 24, respectively. 



approximately the same on all systems. For these setups, the inversion of the fermion matrix takes 
longer. Therefore the force calculation, which only achieves about half the bandwidth utilization of the 
p kernel and takes 40% of the total execution time in setup C, becomes less important for the overall 
performance. The limited performance of the force calculation is also caused by register spilling, even 
though it is not as problematic as in the similar heatbath kernels. 

Not displayed are additional runs of the OpenCL code which we did on AMD Opteron 6172 CPUs. 
Since no specific optimizations for the CPUs have been carried out, these needed significantly more 
runtime. However, switching between both types of devices is smooth, showing that OpenCL code is 
absolutely versatile. 

The comparison of these results to the extensive tests of the CUDA based HMC for the staggered 
fermion discretization reported in [2 2) is somewhat difficult because of the fundamental differences 
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Figure 10: HMC runtimes in seconds for setups A, B, and C (see table [4| for a lattice with N r = 8 and N CT = 24. 
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between the two discretizations. Additionally, the authors of |22) mainly use single precision within 
the HMC, whereas we perform double precision calculations throughout. Although we did not test 
this, the results obtained for the p on NVIDIA cards strongly suggest that our code would currently 
perform significantly slower on these cards compared to AMD GPUs. This is the opposite picture to 
the significantly slower performance of the staggered HMC reported in [22]. Just like our NVIDIA 
results their AMD results are probably affected by the choice of the primary development platform, 
resulting in less optimization work in the other. In addition the staggered fcrmionic data types should 
be smaller, reducing the working set and therefore avoiding register spilling. This should especially 
benefit the NVIDIA GPUs used, as their register file is smaller than that of the AMD GPU. 

6. Conclusions 

We have presented the first implementation of LQCD using OpenCL and shown it is working on 
CPUs as well as the two major GPU platforms, NVIDIA and AMD. 

In single precision our heatbath implementation is able to achieve a competitive 160GFLOPS on 
the AMD Radeon HD 6970, while the double precision variant is still work in progress. Fine tuning 
the applied memory optimizations, which are currently tuned for double precision codes, and the use 
of bandwidth-reducing techniques like REC12 should provide further speedups here. 

For the Wilson p kernel we were able to show excellent performance, utilizing more than 70% of 
the available memory bandwidth for all lattice sizes on multiple AMD GPUs outperforming published 
performances of CUDA based codes. We also see a positive effect of REC12 on the performance. 
Extended to a full HMC for twisted-mass Wilson fermions we showed a speedup factor of four of our 
code running on an AMD Radeon HD 5870 compared to a reference code running on an AMD Opteron 
6172. 

Further speedups to our HMC will be reached by further optimizing the inverter performance. One 
way to reach this will be a mixed precision solver, which up till now we avoided due to the small 
memory of the AMD Radeon HD 5870. As proposed in [T] we will also start to use all compute 
resources on hybrid systems by performing parts of the calculations on the CPU. This will require the 
SIMD capabilities of the CPUs, too. One approach for this could be the redefinition of our base types 
for each device. In addition, we are currently investigating possible performance gains by using REC12 
in the force calculation. More complex reduction techniques may also be of advantage here. Finally, to 
reduce processing time and circumvent memory limitations we will expand our code to run on multiple 
GPUs. 
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